Skip to content

Add timeout and in-flight observability to OperationTimedOut#818

Open
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:better_timeout_print
Open

Add timeout and in-flight observability to OperationTimedOut#818
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:better_timeout_print

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 14, 2026

Summary

Improve timeout observability in the Python driver, inspired by the Go driver PR scylladb/gocql#847.

  • Commit 1: Fix unfilled %s format string in add_execution_profile timeout message (cherry-pickable)
  • Commit 2: Add timeout and in_flight fields to OperationTimedOut

Changes

OperationTimedOut enhancement (cassandra/__init__.py)

  • New optional timeout and in_flight keyword parameters (backward compatible)
  • Exception message appends (timeout=Xs, in_flight=N) when timeout is set
  • When timeout is None or 0, no timeout info is printed

All raise sites updated (cassandra/connection.py, cassandra/cluster.py)

  • All 7 production raise sites now pass timeout= and in_flight= where available

Tests

  • 7 new unit tests for OperationTimedOut message formatting and attribute access
  • Updated existing heartbeat and response future tests to assert timeout and in_flight values

Fixes: DRIVER-538

The error message at Cluster.add_execution_profile() had an unfilled %s
placeholder: 'Failed to create all new connection pools in the %ss timeout.'
The pool_wait_timeout value was never interpolated into the string, so
users would see a literal '%s' instead of the actual timeout value.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 15, 2026

Code looks reasonable to me. Main question - do we want to log as warning, and not as debug, the timeout logs?

@mykaul mykaul marked this pull request as ready for review April 15, 2026 10:14
Copy link
Copy Markdown

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 problems:

  • Putting additional info in exceptions is fine, and is the correct way to get such info to users. Logging such things is not. Those are log on execution path (= spammy), printed in situations that can occur during normal operations. Let's keep the info in exception, it will be enough, and it was always how we got such info to user.
  • IIUC in_flight parameter is supposed to help verify if the timeout was encountered because of slow response from server, or because of wait on available stream on a connection. in_flight seems like not a great way to do this. More direct way would be to imo measure a time it actually took to get a stream id in borrow_connection function.

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 15, 2026

Thanks @Lorak-mmk , see below:

2 problems:

  • Putting additional info in exceptions is fine, and is the correct way to get such info to users. Logging such things is not. Those are log on execution path (= spammy), printed in situations that can occur during normal operations. Let's keep the info in exception, it will be enough, and it was always how we got such info to user.

OK, I'll remove.

  • IIUC in_flight parameter is supposed to help verify if the timeout was encountered because of slow response from server, or because of wait on available stream on a connection. in_flight seems like not a great way to do this. More direct way would be to imo measure a time it actually took to get a stream id in borrow_connection function.

I think it should go into a separate item. Should I remove the 'in flight' for the time being, if we think it's not useful enough? What's the downside of having it, btw? Misleading?

@roydahan
Copy link
Copy Markdown
Collaborator

@mykaul mykaul force-pushed the better_timeout_print branch from 94fe0b7 to 56b57c0 Compare April 15, 2026 14:35
Improve timeout observability in the driver, inspired by the Go driver
PR scylladb/gocql#847.

OperationTimedOut now carries optional timeout and in_flight fields that
are appended to the exception message when present (e.g.
"(timeout=10.0s, in_flight=42)"). All seven production raise sites in
connection.py and cluster.py pass these values where available.

Additionally, debug-level log lines are emitted for:
- Client-side request timeouts (host, timeout, in_flight, orphaned)
- Server-side read/write timeouts (host, consistency, received/required,
  data_retrieved/write_type, retry decision)

A helper _retry_decision_name() translates RetryPolicy constants to
human-readable strings for the log messages.

New keyword-only parameters are backward compatible — existing callers
that pass only positional errors/last_host continue to work unchanged.

Fixes: DRIVER-538

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 15, 2026

@mykaul mykaul requested a review from Lorak-mmk April 16, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants